21 research outputs found
Unveiling Multilinguality in Transformer Models: Exploring Language Specificity in Feed-Forward Networks
Recent research suggests that the feed-forward module within Transformers can
be viewed as a collection of key-value memories, where the keys learn to
capture specific patterns from the input based on the training examples. The
values then combine the output from the 'memories' of the keys to generate
predictions about the next token. This leads to an incremental process of
prediction that gradually converges towards the final token choice near the
output layers. This interesting perspective raises questions about how
multilingual models might leverage this mechanism. Specifically, for
autoregressive models trained on two or more languages, do all neurons (across
layers) respond equally to all languages? No! Our hypothesis centers around the
notion that during pretraining, certain model parameters learn strong
language-specific features, while others learn more language-agnostic (shared
across languages) features. To validate this, we conduct experiments utilizing
parallel corpora of two languages that the model was initially pretrained on.
Our findings reveal that the layers closest to the network's input or output
tend to exhibit more language-specific behaviour compared to the layers in the
middle
HUME: Human UCCA-Based Evaluation of Machine Translation
Human evaluation of machine translation normally uses sentence-level measures
such as relative ranking or adequacy scales. However, these provide no insight
into possible errors, and do not scale well with sentence length. We argue for
a semantics-based evaluation, which captures what meaning components are
retained in the MT output, thus providing a more fine-grained analysis of
translation quality, and enabling the construction and tuning of
semantics-based MT. We present a novel human semantic evaluation measure, Human
UCCA-based MT Evaluation (HUME), building on the UCCA semantic representation
scheme. HUME covers a wider range of semantic phenomena than previous methods
and does not rely on semantic annotation of the potentially garbled MT output.
We experiment with four language pairs, demonstrating HUME's broad
applicability, and report good inter-annotator agreement rates and correlation
with human adequacy scores
Results of the WMT15 Metrics Shared Task
This paper presents the results of the WMT15 Metrics Shared Task. We asked
participants of this task to score the outputs of the MT systems involved in
the WMT15 Shared Translation Task. We collected scores of 46 metrics from 11
research groups. In addition to that, we computed scores of 7 standard metrics
(BLEU, SentBLEU, NIST, WER, PER, TER and CDER) as baselines. The collected scores were
evaluated in terms of system level correlation (how well each metric's scores
correlate with WMT15 official manual ranking of systems) and in terms of segment
level correlation (how often a metric agrees with humans in comparing two
translations of a particular sentence)
The WMT'18 Morpheval test suites for English-Czech, English-German, English-Finnish and Turkish-English
Peer reviewe
Ten Years of WMT Evaluation Campaigns: Lessons Learnt
The WMT evaluation campaign (http://www.statmt.org/wmt16) has been run annually since 2006. It is a collection of shared
tasks related to machine translation, in which researchers compare their techniques against those of others in the field. The longest
running task in the campaign is the translation task, where participants translate a common test set with their MT systems. In addition
to the translation task, we have also included shared tasks on evaluation: both on automatic metrics (since 2008), which compare the
reference to the MT system output, and on quality estimation (since 2012), where system output is evaluated without a reference. An
important component of WMT has always been the manual evaluation, wherein human annotators are used to produce the official ranking
of the systems in each translation task. This reflects the belief of theWMTorganizers that human judgement should be the ultimate arbiter
of MT quality. Over the years, we have experimented with different methods of improving the reliability, efficiency and discriminatory
power of these judgements. In this paper we report on our experiences in running this evaluation campaign, the current state of the art in
MT evaluation (both human and automatic), and our plans for future editions of WMT
Edinburgh's Statistical Machine Translation Systems for WMT16
This paper describes the University of Edinburgh’s
phrase-based and syntax-based
submissions to the shared translation tasks
of the ACL 2016 First Conference on Machine
Translation (WMT16). We submitted
five phrase-based and five syntaxbased
systems for the news task, plus one
phrase-based system for the biomedical
task
Moses: Open Source Toolkit for Statistical Machine Translation
We describe an open-source toolkit for statistical machine translation whose novel contributions are (a) support for linguistically motivated factors, (b) confusion network decoding, and (c) efficient data formats for translation models and language models. In addition to the SMT decoder, the toolkit also includes a wide variety of tools for training, tuning and applying the system to many translation tasks
Findings of the WMT 2017 Biomedical Translation Shared Task
Automatic translation of documents is an important task in many domains, including the biological and clinical domains. The second edition of the Biomedical Translation task in the Conference of Machine Translation focused on the automatic translation of biomedical-related documents between English and various European languages. This year, we addressed ten languages: Czech, German, English, French, Hungarian, Polish, Portuguese, Spanish, Romanian and Swedish. Test sets included both scientific publications (from the Scielo and EDP Sciences databases) and health-related news (from the Cochrane and UK National Health Service web sites). Seven teams participated in the task, submitting a total of 82 runs. Herein we describe the test sets, participating systems and results of both the automatic and manual evaluation of the translations